A Lasso for Hierarchical Interactions 



Jacob Bien * 



Jonathan Taylor t 



Robert Tibshirani * 



February 1, 2013 



Abstract 



We add a set of convex constraints to the Lasso to produce sparse interaction models that honor 
the hierarchy restriction that an interaction only be included in a model if one or both variables are 
marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove 
that hierarchy holds with probability one, and derive an unbiased estimate for the degrees of freedom of 
our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. 

We distinguish between parameter sparsity — the number of nonzero coefficients — and practical spar- 
sity — the number of raw variables one must measure to make a new prediction. Hierarchy focuses on 
the latter, which is more closely tied to important data collection concerns such as cost, time, and effort. 
We develop an algorithm, available in the R package hierNet, and perform an empirical study of our 
method. 

Keywords: Regularized regression, lasso, interactions, hierarchical sparsity, convexity 

1 Introduction. 

There are numerous situations in which additive (main effects) models are insufficient for predicting 
an outcome of interest. In medical diagnosis, the co-occurrence of two symptoms may lead a doctor to be 
confident that a patient has a certain disease whereas the presence of either symptom without the other 
would provide only a moderate indication of that disease. This situation corresponds to a positive (i.e., 
synergistic) interaction between symptom variables. On the other hand, suppose both symptoms convey 
redundant information to the doctor about the patient so that knowing both provides no more information 
about the disease status than either one on its own. This situation is again not additive, but this time 
there is a negative interaction between symptoms. Fitting regression models with interactions is challenging 
when one has even a moderate number, p, of measured variables, since there are (£) interactions of order k. 
For this paper, we focus on the case of pairwise (k = 2) interaction models, although the ideas we develop 
generalize naturally to higher-order interaction models. 
1.1 Two-way interaction model. 

We consider a regression model for an outcome variable Y and predictors X\, . . . , X p , with pairwise 
interactions between these predictors. In particular our model has the form 



where e <~ N(0, a 2 ). Regardless of whether the predictors are continuous or discrete, we will refer to the 
additive part as the "main effect" terms and the quadratic part as the "interaction" terms. Our goal is to 
estimate (3 <E W and O e M pxp , where O = T and Qjj = 0. The factor of one half before the interaction 
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summation is a consequence of our notational decision to deal with a symmetric matrix O of interactions 
rather than a vector of length p(j> — l)/2. We take Qjj = throughout this paper because it simplifies 
notation, but everything carries over if we remove this restriction and we provide this as an option in the 
hierNet (pronounced "hair net") package. 

We observe a training sample, (xi,yi), [x n , y n ), and our goal is to select a subset of thep+p(p— 1)/2 
main effect and interaction variables that is predictive of the response, and to estimate the values for the 
nonzero parameters of the model. 
1.2 Strong and weak hierarchy. 

It is a well-established practice among statisticians fitting ([!]) to only allow an interaction into the 
model if the corresponding main effects are also in the model. Such restrictions are known under various 



names, including "heredity," "marginality," and being "hierarchically well-formulated" (Hamada& Wu 1992 



Chipman 


1996 


Ncldcr 


1977 



and weak hierarchy: 

Strong Hierarchy: @ jk ^ fa ^ and (3 k ^ 
Weak Hierarchy: Q jk ^ fa ^ or f3 k ^ 0. 

Some statisticians argue that models violating strong hierarchy are not sensible. For example, according to 



McCullagh & Nelder (1983) 



"[T]here is usually no reason to postulate a special position for the origin, so that the linear terms 
must be included with the cross-term." 

To see that violating strong hierarchy amounts to "postulating a special position for the origin," consider 
writing an interaction model as Y = (3q + (f3i + Qi 2 X 2 )Xi + ■ ■ • . First of all, we would only take (3q = if we 
have special reason to believe that the regression surface must go through the origin. Likewise, taking /3i = 
but 0i2 7^ would only be appropriate if we actually believe that X±'s effect on Y should only be present 
specifically when X 2 is nonzero. In most situations, we do not think that the variable X 2 that we measured 
is any more special than aX 2 + b. Yet if our model with X 2 violates strong hierarchy, then our model with 
aX 2 + b (for any b 7^ 0) is strongly hierarchical. This argument suggests that violations to hierarchy occur 
in special situations whereas hierarchy is the default. 



Another argument in favor of hierarchy has to do with statistical power. In the words of Cox (1984): 



"[L]arge component main effects are more likely to lead to appreciable interactions than small 
components. Also, the interactions corresponding to larger main effects may be in some sense of 
more practical importance." 

In other words, rather than looking at all possible interactions, it may be useful to focus our search to those 
interactions that have large main effects. Indeed, the method we propose in this paper makes direct use of 
this principle. 

As a final argument for hierarchy, it is useful to distinguish between two notions of sparsity, which we 
will call parameter sparsity and practical sparsity. Parameter sparsity is what most statisticians mean by 
"sparsity" : the number of nonzero coefficients in the model. Practical sparsity is what someone actually 
collecting data cares about: the number of variables one needs to measure to make predictions in the future. 
The hierarchy restriction favors models that "reuse" measured variables whereas a nonhierarchical model 
does not. The left panel of Figure [T] gives a small example where this difference is manifest. In fact, a 
simple calculation shows that this difference can be quite substantial: we can have a hierarchical and a 
nonhierarchical interaction model with the same parameter sparsity but with the nonhierarchical method 
having a practical sparsity of k(k + 1) where the hierarchical method's practical sparsity is just k. 

While taking these arguments to the extreme leads to the use of strong hierarchy exclusively, we develop 
the case of weak hierarchy in parallel throughout this paper. Weak hierarchy, as the name suggests, can 
be thought of as a compromise between strong hierarchy and imposing no such structure and appears as a 
principle in certain statistical methods such as classification and regression trees ( Breiman et al.|[l984 ) and 
multivariate additive regression splines ( Friedman]| 1991 ). 
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1.3 Sparsity, the Lasso, and Structured Sparsity. 



The Lasso ( |Tibshirani|1996 l is a method that performs both model selection and estimation. It penalizes 
the squared loss of the data with an ^-norm penalty on the parameter vector. This penalty has the property 
of producing estimates of the parameter vector that are sparse (corresponding to model selection). Given a 
design matrix X € M. nxd and response vector y £ M. n , the Lasso is the solution to the convex optimization 
problem: 



Minimize — lly- 
Po,4> 2 l|y 



p i-X4>\\ 2 + \\\4>\\ 



The penalty parameter, A > 0, controls the relative importance of fitting to the training data (sum-of-squares 
term) and of sparsity (l\ penalty term) . A natural extension of the Lasso to our interaction model ([I]) would 
be to take (f> T = [f3 T , vec(9) T ] and X = (X ; Z/2), where the columns of Z E M™*?^- 1 ) correspond to 
elementwise products of the columns of X. We will refer to this method as the All-Pairs Lasso since it is 
simply the Lasso applied to a data matrix which includes all pairs of interactions (as well as all main effects) . 
It is common with the Lasso to standardize the predictors so that they are on the same scale. In this paper, 
we standardize X so that its columns have mean and standard deviation 1; we then form Z from these 
standardized predictors, and finally center the resulting columns of Z. By centering y and X, we may take 

A> = o. 

The Lasso's l\ penalty is neutral to the pattern of sparsity, allowing any sparsity pattern to emerge. 
The notions of strong and weak hierarchy introduced in Section |1.2| represent situations in which we want 
to exclude certain sparsity patterns. There has been a growing literature focusing on methods that produce 



structured sparsity (Yuan & Lin 2006 Zhao et al. 2009 Jenatton et al. 2010 2011| Bach|[20lT Bach et al 



2012). These methods make use of the group lasso penalty (and generalizations thereof), which, given a 

3 to be set i 
penalty by 



predetermined grouping of the parameters, induce entire groups of parameters to be set to zero (Yuan & Lin 
2006). Given a set of groups of variables, Q, these methods generalize the l\ 



d GUG\\ 1G , 

where jg > L is 4> projected onto the coordinates in G, and do is a nonnegative weight. Hierarchical 



structured sparsity is obtained by choosing Q to have nested groups. For example, Zhao et al. (2009) consider 
the penalty 

^{16^1 + 11(^,^,6^)11^}. 



Likewise, the framework of Bach et al. (2012) if specialized to this paper's focus would lead to a penalty of 
the form 



|e||i + E^IK e i'^)ll 



(2) 



for some q > 1 and dj > 0. In fact, Radchenko & James (2010) suggest a penalty for generalized additive 



models with interactions that reduces to (|2j) in the linear model case, with q = 2 and dj independent of j 
1.4 This paper. 

Here, we propose a lasso-like procedure that produces sparse estimates of (3 and O while satisfying the 
strong or weak hierarchy constraint (Section [2]). In contrast to much of the structured sparsity literature 
which is based on group lasso penalties, our approach, presented in Section[2j involves adding a set of convex 
constraints to the Lasso. Although we find this form of constraint more naturally interpretable, we show 
(Remark 3) that this problem can be equivalently expressed in a form that relates it to penalties from the 
structured sparsity literature such as |2]). 

A key advantage of our specific choice of penalty structure is that it admits a simple interpretation of the 
effect of the hierarchy demand. Unlike other hierarchical sparsity methods, which do not pay much attention 
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to the particular choice of norms (as long as 7g > 1), our formulation is carefully tailored to allow it to be 
related directly back to the Lasso, permitting one to understand specifically how hierarchy alters the solution 
(Section 3.1). This feature of our estimator gives it a transparency that exposes the effects (both positive 
and negative!) of the hierarchy constraint. Furthermore, our characterization suggests that the demand for 
hierarchy is — analogous to the demand for sparsity — a form of "regularization." We develop an unbiased 
estimator of the degrees of freedom of our method (Section |3.3[ ) and an interpretable upper bound on this 
quantity, which also points to hierarchy as regularization. In particular, we show that we do not "spend" in 
degrees of freedom for main effects that are forced into the model by the hierarchy constraint. 

Another difference from much of the structured sparsity literature, which aims to develop a broad treat- 
ment of structured and hierarchical sparsity methods, is that our focus is narrowed to the problem of 
interaction models. Our restricted scope allows us to address specifically the performance of such a tool to 
this important problem. In Section [4j we review previous work on the problem of hierarchical interaction 
model fitting and selection. These methods fall into three categories: Multi-step procedures, which are 
defined by an algorithm (|Peixoto||1987[ |Friedman||1991[ |Turlach||2004"l |Nardi k Rinaldo||2007[ |Bickel et al 



2008 Park & Hastie 2008 



through a prior ( Chipman 



Wu et al. 2010); Bayesian approaches, which specify the hierarchy requirement 



1996); and, most related to this paper's proposal, regularized regression methods, 



which are defined by an optimization problem ( Yuan et al.|2009 Zhao et al.|2009 Choi et al.|2010 Jenatton 
et al. 2010 Radchenko &: James||2010 |. In Section |5| we study via simulation the statistical implications 
of imposing hierarchy on an interactions-based estimator under various scenarios (in both the Lasso and 
stepwise frameworks). In Section[6]we present an efficient algorithm for computing our estimator. Real data 
examples are used to illustrate a distinction we draw between "parameter sparsity" and "practical sparsity" 
and to discuss hierarchy's role in promoting the latter. 

2 Our proposed method. 



In Section |1.3| we introduced the All-Pairs Lasso, which can be written as 

A 



Minimize 

/3 gr, /3grp, eei 



g(/3 ,/?,e) + A|!/3|| 1 + -||e|! 1 s.t. e = e J 



(3) 



where ||G||i = J2j^k \®jk\ and <l(Po,P,&) is the loss function, typically \ J2i=i(Vi ~ Po ~ x f P ~ \ x J® x i) 2 = 
— /Jol — Xj3 — Zvec(9)/2|| 2 , but may also include a ridge penalty on the coefficients as discussed later or 
may be substituted for the binomial negative log-likelihood. The one-half factors in front of terms involving 
are merely a consequence of the notational choice to represent O as a symmetric matrix (with Qjj = for 
j = 1, . . . ,p). In this paper, we propose a modification of the All- Pairs Lasso that produces models that are 
guaranteed to be hierarchical. 

As motivation for our proposal, consider building hierarchy into the optimization problem as a constraint: 



Minimize q(/3 , p, 0) + All/31 

s.t. e = e T , ii 



A 



©Hi 



(4) 



Qjh < I Pi I for 3=1, 



,p, 



where Qj denotes the jth row (and column, by symmetry) of O. Notice that if Qjk ^ 0, then ||0j||i > 
and ||Bfe||i > and thus [3j ^ and fik 0. While the added constraints enforce strong hierarchy, they are 
not convex, which makes Q undesirable as a method. In this paper, we propose a straightforward convex 
relaxation of Q, which we call the Strong Hierarchical Lasso: 



Minimize 

Po&R, p ± eR p 
eeR pXp 



q(l3o,P + -P~,Q) + Al T (/?+ +/T) + p0||i 



(5) 



s.t. e = e J 



\ej\u<pf + pj 

Pt>0,pr>o 



for j = 1, . 
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where we have replaced the optimization variable (3 £ M. p by two vectors f3 + ,f3~ £ W. After solving the 
above problem, our fitted model is of the form f(x) — $o+x T — f3~)+x T <dx/2. While we might informally 
think of f3 + and j3~ as positive and negative parts of a vector (3 = f3 + — /3~ , i.e. that ^F- = max{±/3, 0}, this 
is not actually the case since at a solution we can have both j3j > and j3~ > 0. Indeed, if we were to add 
the constraints /?+ ' (3J = for j = 1, ... ,p to ([5]), then these would be positive and negative parts and so 
flf + (3~ = \(3j\, giving us precisely problem Q. This observation establishes that ^ is a convex relaxation 
of (|J). 

The hierarchy constraints can be seen as an embedding into our method of David Cox's "principle" that 
"large component main effects are more likely to lead to appreciable interactions than small components." 
The constraint 

l|e,|| 1 </3+ + /37 

budgets the total amount of interactions involving variable Xj according to the relative importance of Xj as 
a main effect. One additional advantage of the convex relaxation is that the constraint is less restrictive. If 
the best fitting model would have ||Oj||i large but \j3j \ only moderate, this can be accommodated by making 
fit and j3~ both large. 

Remark 1: Another possibility for the hierarchy constraint that we have considered is \@jk\ < + Pj\ 
however, we have found that this can lead to an overabundance of interactions relative to main effects. 

Remark 2: It is desirable to include in the loss function q an elastic net term, (e/2) (||0|||* + ll/3 + l| 2 + l| 2 ) i 
to ensure uniqueness of the solution ( Zou fc Hastie|2005 1 . We think of e > as a fixed tiny fraction of A, such 



as e = 10 A, rather than as an additional tuning parameter. Such a modification does not complicate the 
algorithm, but simplifies the study of the estimator. In all numerical examples and in the hierNet package, 
we use this elastic net modification. 

Remark 3: We prove in Appendix [A] that ^ may equivalently be written as 

Minimize q(j3 ,/3,Q) + A V max{|/3, |, + ^||9||i. (6) 

eeR pxp ,e=e T 3 

This reparameterization of the problem shows its similarities to the group lasso based methods. In place of 
the more standard penalty ||(Oj,/3j)|| 9 of pL we use max{||6j||i, |/3j|}. In Section 3.1 



we show that this 



unusual choice of penalty admits a particularly simple interpretation for the effect of imposing hierarchy. 



In Section |1.2| we also introduced the notion of weak hierarchy. By simply removing the symmetry 
constraint on 0, we get what we call the Weak Hierarchical Lasso: 

Minimize q(0 o , /?+ - /T, 6) + Al T (/3+ + /T) + £ ||e||i (7) 

f 110,11! </3+ + /3-l . 
s.t. . > for 7 = 1, . . . ,p. 

#>0,/3r> J 

Even though at a solution to this problem, is not symmetric, we should think of the interaction coefficient 
as (&jk + 0/cj)/2 since this is what multiplies the interaction term XijXik when computing f(x{). 

Remark 4: We can build further on the connection between ^ and ^ discussed in Remark 3. Our removal 
of the symmetry constraint in ( |7|) is analogous to the technique of duplicating columns of the design matrix 
used in the overlap group lasso ( Obozinski et al.||2011 ). 



A favorable property that distinguishes our method from previous approaches discussed in Section [4] is 
the relative transparency of the role that the hierarchy constraint plays in our estimator. This aspect is 



developed in Section 3.1 
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Although our primary focus in this paper is on the Gaussian setting of ([I]) , our proposal extends straight- 
forwardly to other situations, such as the logistic regression setting in which the response is binary. In this 
case, we simply have q((3o, ft, @) be the appropriate negative log-likelihood: — Y17=i V* ^°SPi + 0- ~ Vi) l°g(l ~ 
Pi), where Pi = [1 + cxp(— /3 — xjff — \xf Qxi)]~ l . In Appendix PR we show that to solve this problem 
requires only a minor modification to our primary algorithm. It should also be noted that our estimator 
(and the algorithms developed to compute it) is designed for both the p < n and p > n setting. 

As a preliminary example, consider predicting whether a sample of olive oil comes from Southern Apulia 
based on measurements of the concentration of p = 8 fatty acids ( Forina et al.| [l983). The dataset consists 



of n — 572 samples, and we average our results over 100 random equal-sized train-test splits. We compare 
three methods: (a) a standard Lasso with main effects only (MEL), (b) the All-Pairs Lasso (APL), and (c) 
the Strong Hierarchical Lasso (HL). 

The top left panel of Figure [l] shows an interesting difference between HL and APL. We see that, on 
average, at a parameter sparsity level of five, the HL model uses four of the measured variables whereas APL 
uses six. Using the hierarchical model to classify a future olive oil, we only need to measure four rather than 
six of the fatty acids. 

The top right panel of Figure [l] shows the predictive performance (versus the practical sparsity) of the 
three methods. It appears that HL enjoys the "best of both worlds," matching the good performance of MEL 
for low practical sparsity levels (since it tends to pick out the main effects first) and the good performance of 
APL at high practical sparsity levels (since it can incorporate predictive interactions). Finally, the bottom 
panel of the figure provides a visual display of a sequence of HL's solutions (by varying A). Nonzero main 
effects are shown as filled nodes and edges indicate nonzero interactions. Since all edges are incident to filled 
nodes, we see that strong hierarchy holds. 

In the next section, we present several properties of our estimator that shed light on the effect of adding 
the convex hierarchy constraint to the Lasso. Among these properties is an unbiased estimate of the degrees 
of freedom of our estimator. We view this degrees of freedom result as valuable primarily for the sake of 
understanding the effect of hierarchy. While such an estimate could be used for parameter selection, we 
prefer cross validation to select A since this is more directly tied to the goal of prediction. 

3 Properties. 



3.1 Effect of the constraint. 

A key advantage of formulating an estimator as a solution to a convex problem is that it can be com- 
pletely characterized by a set of optimality conditions, known as the Karush-Kuhn-Tucker (KKT) conditions. 
These conditions are useful for understanding the effect that the hierarchy constraint in ^ and Q has on 
our solutions. In this section, we will study the simplest case, taking q(pa,j3, 0) to be the quadratic loss 
function with no elastic net penalty. We let 

r^^ =y -y + Xjfij 

r (-M =y _$ + ( x .* Xk )(e jk + e kj )/2 

denote partial residuals (where * denotes elementwise multiplication), and we assume that ||xj||2 = 1- For 
linear regression, the KKT conditions are known as the normal equations and can be written as 

fa = xfr(-fi 
The All-Pairs Lasso solution satisfies 

P s =S(a$r<-fl, A) 

where S denotes the soft-thresholding operator defined by S(c, A) = sign(c)(|c| — A)+. Written this way, 
we see that the Lasso is similar to linear regression, but all coefficients are shrunken towards 0, with some 



9j* = 



(xj * Xfe) T r( J ' fe ) 

\\ x j *%k\\ 2 



S[{Xj*x k ) T r^ k \ A] 
\\x« * x k \\ 2 



(8) 
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Figure 1: Olive oil data: (Top left) Parameter sparsity is the number of nonzero coefficients while practical 
sparsity is the number of measured variables in the model. Results from all 100 random train-test splits are 
shown as points; lines show the average performance over all 100 runs. (Top right) Misclassification error on 
test set versus practical sparsity. (Bottom) Wheel plots showing the sparsity pattern at 6 values of A for the 
Strong Hierarchical Lasso. Filled nodes correspond to nonzero main effects and edges correspond to nonzero 
interactions. 
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coefficients (those for which \xJA i'\ < 0) set to zero. It is instructive to examine the corresponding 
statements for the Strong and Weak Hierarchical Lasso methods. 

Property 1. The coefficients of the Strong and Weak Hierarchical Lassos with A > and taking q(f3o, (3, 0) 
to be the quadratic loss (with no elastic net penalty) satisfy 

• Strong: 

Pf-Pj =S(xJr^\ A - &j) 

g _ S[( Xj *x k ) T A-i k \ A + aj+a fe ] 

"jk — n ii9 ' 

\\Xj * x k \\ 2 

• Weak: 

Pj-Pj =S(xjr^\ A -a 3 ) 
Qjfc + Qfcj _ S[(xj * x k ) T r < ~~ :lk \ A + 2min{a i ,q fc }] 

for some ctj > 0, j = 1, .. . ,p with ctj = when ||0j||i < pj + $J (and likewise for ctj). 

Proof. See Appendix [B] □ 

The ctj , ctj appearing in the above two properties are optimal dual variables corresponding to the jth 
hierarchy constraint for the Strong and Weak Hierarchical Lasso problems, respectively. When ||0j||i < 
{Sj + (3~ , we have Stj — (or ctj — 0) by complementary slackness. Comparing these expressions to those 
of the All-Pairs Lasso gives insight into the effect of the constraint. Property [l] reveals that the overall form 
of the All-Pairs Lasso and Hierarchical Lasso methods are identical. The difference is that the hierarchy 
constraint leads to a reduction in the shrinkage of certain main effects and an increase in the shrinkage of 
certain interactions. In particular, we see that when the hierarchy constraints are loose at the solution, i.e. 
II® j 111 < Pj + Pj ■> the Weak Hierarchical Lasso's optimality conditions become identical to the All-Pairs 
Lasso (since ctj — 0) for all coefficients involving Xj. For the Strong Hierarchical Lasso, when both the jth 
and kth constraints are loose, the optimality conditions match those of the All-Pairs Lasso for the coefficients 
of X j, Xfc, and Xj * x k . The methods differ when constraints are active, i.e. when |]0j||i — pf + Pa , which 
allows ctj (or ctj) to be nonzero. Intuitively, this case corresponds to the situation in which hierarchy would 
have not held "naturally" (i.e. without the constraint) and the corresponding dual variable plays the role 
of reducing Qj in £i-norm and increasing p!~ + (3~ until the constraint is satisfied. The way in which the 
Weak and Strong Hierarchical Lasso methods perform this shrinkage is different, but both are identical to 
All-Pairs Lasso when all constraints are loose. 
3.2 Hierarchy guarantee. 

In Section[2j we showed that adding the constraint ||0j||i < \Pj\ would guarantee that hierarchy holds. 
However, we have not yet shown that the same is true of the convex relaxation's constraint, ||Oj ||i < Pj + ftj ■ 

In particular, while Qjk ^ (3^ + (ij ^ 0, we could still have $f — /3j~ = 0. This would correspond 

to a model in which XjX k is used in the model, but Xj is not. Intuitively, we would expect that if ftj > 0, 
then pT — pj is analogous to getting an exact zero in linear regression (i.e. a zero probability event). In 
this section, we establish that this is in fact the case. 

In particular, we study ([5| and Q where g(/?0j P ©) includes an elastic net term. The importance of this 
modification is that it ensures uniqueness, simplifying the analysis. As noted in Remark 2, we think of e as 
a small, fixed proportion of A rather than as a separate tuning parameter. 

Property 2. Suppose y is absolutely continuous with respect to the Lebesgue measure on M. n . If (/3 + , 0) 
solves (|5j), where q(Po, A©) is the quadratic loss with an e > ridge penalty, then strong hierarchy holds 
with probability 1, i.e. 

0,fc ? =>• Pj - Pj * AND p+ - p- ? 0. 
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Proof. See Appendix [Cj □ 

To understand how dropping the symmetry constraint leads to the "or" statement required of weak 
hierarchy, note that XjX k is in the Weak Hierarchical Lasso model if and only if Qj k + Ofcj 7^ 0. This holds 
only if Qj k 7^ or Q k j 7^ 0. 

Property 3. Suppose y is absolutely continuous with respect to the Lebesgue measure on M. n . If (/3 + , P~, O) 
solves ([7]), where q(Po, P, O) is the quadratic loss with an e > ridge penalty, then weak hierarchy holds with 
probability 1, i.e. 

(e Jfc + e kj )/2 ? =► pf - pj ^ or p+ - p k ? 0. 

Proof. See Appendix [Cj □ 

3.3 Degrees of Freedom. 

In classical statistics, the degrees of freedom of a procedure refer to the dimension of the space over 
which its fitted values can vary. It is useful in that it provides a measure of how much "fitting" the procedure 
is doing. This notion can be generalized to adaptive procedures such as the Lasso ( Stein|[l98T Efron|[l986 



Efron et al.|2004||Zou et al.|2007[ ). See (Ryan) |Tibshirani fc Taylor] ( |2012[ ) for a thorough discussion. If given 



data y £ R n , a procedure h produces fitted values y = h(y) £ K", the degrees of freedom of the procedure h 
is defined to be 

1 - 

<V(h) = -^^2,cav{y i ,y i ). (9) 
i=i 

Property 4. Suppose y ~ N(ii,a 2 I n ). An unbiased estimate of the degrees of freedom of the Strong Hier- 
archical Lasso, with quadratic loss and no ridge penalty, is given by 

df x = mnk(XP), 

where X = (X : —X : Z/2 : —Z/2) with Z containing the interactions, and P is a projection matrix which 
depends on the sign pattern of P~ , O) and on the set of hierarchy constraints that are tight. 

Proof. See Appendix |Dj □ 

Figure [i] provides a numerical evaluation of how well dfA estimates df>. We fix X £ M nxp , e R p , and 
O e W xp , and we generate B = 10,000 Monte Carlo replicates y^ x \ ■ ■ ■ € M". For each replicate, we 

fit the Strong Hierarchical Lasso along a grid of A values to get 0x > A\ j ®> , ) an d € M. n . From 
these values, we compute Monte Carlo estimates of dfA from the definition in ^ and of i?[dfx]. 

While df x can be calculated from the data and is therefore useful as an unbiased way of calibrating the 
amount of fitting the Strong Hierarchical Lasso is doing, this expression is difficult to interpret. However, it 
turns out that we can bound df \ by a quantity that does make more sense. 

Property 5. Let T = {j : = p+ + Pj} ; A p = {j : Pj - Pj £ 0}, A± - {j : Pf > 0}, and 

A e = {jk-.e jk ^0,j <k}. Then, 

df x < |^| + L4 e |-|Tn(.4 + A.4_)| 
holds almost surely, where A+AA- = {A+ \ A-) U (A— \ A+). 

Proof. See Appendix |P| □ 
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Actual df 

Figure 2: Numerical evaluation of how well di\ estimates di\. Monte Carlo estimates of £7 [dfx] (y-axis) 
versus Monte Carlo estimates of df^ (x-axis) for a sequence of A values (circular) are shown. One standard 
error bars are drawn and are hardly visible. Our bound on the unbiased estimate is plotted with diamonds. 



By contrast, for the All-Pairs Lasso in the case that p + (2) < n and the design matrix is full rank, we 
have df\(APL) — E[\Ap\ + |.4e|] (Zou et al. |2007[ ). In other words, the Strong Hierarchical Lasso does not 
"pay" (in terms of fitting) for those main effects, /3j~ — (3~ , that are forced into the model by the hierarchy 
constraint to accommodate a strong interaction. Notice that we do pay for a nonzero main effect if both /3+ 

and P~ are nonzero. This makes sense since the constraint could be satisfied with just one of these variables 
nonzero, but in this case it is advantageous to the fit to make both nonzero. In Figure [2j we find that this 
bound is in expectation visually indistinguishable from E[df x ]. 

4 Related Work. 

There has been considerable interest in fitting interaction models in statistics and related fields. We 
focus here on an overview of methods that aim at forming predictive models that satisfy the hierarchical 
interactions restriction. 
4.1 Multi-step procedures. 

Many statistics textbooks discuss a simple stepwise procedure in which one iteratively considers adding 
or removing the "best" variable (whether it be main effect or interaction); they add that one should only 



consider including an interaction if its main effects are in the model (e.g., see backward elimination in Agresti 



2002 Section 6.1.3). In doing so, they are enforcing the strong hierarchy restriction. Such procedures are 



ubiquitous ( Nelder|1997 Peixotofl987 ) as are more recent versions ( Friedman|1991 Bickel et al.|2008 Park 



& Hastie 2008 |Wu et al. 2010). Another approach is to perform model selection first without considering 



hierarchy and then to include any lower-order terms necessary to satisfy hierarchy as a post-processing step 
( |Nardi fc Rinaldo||2007[). F inally, |Turlach| ( |2004[ ) and |Yuan et al.] ( |2007[ ) consider modifying the LARS 
algorithm ( |Efron et al.||2004[ ) so that hierarchy is enforced. 
4.2 Bayesian approaches. 

Another set of procedures for building hierarchical interaction models comes from a Bayesian viewpoint. 
Chipman] ( |1996| ) adapt the Stochastic Search Variable Selection (SSVS) approach of George & McCulloch 



( 1993 ) to produce strong or weak hierarchical interaction models. SSVS makes use of a hierarchical normal 
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mixture model to perform variable selection in regression. Every variable has a latent binary variable 
indicating whether it is "active." Conditional on this latent variable, each coefficient is a 0-mean normal 
with variance determined by the latent importance of the coefficient. The original SSVS paper chooses a 
prior in which the importance of each variable is an independent Bernoulli, 
dependence into the prior so that Qjk is important only if f3j and/or f3k is as well. 
4.3 Optimization-based approaches. 



Chipman (1996) introduces 



|Choi et al. (2010) formulate a nonconvex optimization problem to get sparse hierarchical interaction 
models. They write Qjk = Tjkf3j/3k, where j3 are the main effect coefficients and then apply i\ penalties on 
(3 and T. Notice that Qjk 7^ implies /3j 7^ and /3k 7^ 0. The nonconvexity arises in writing Qjk as the 
product of optimization variables. 

Most similar to this paper's proposal are a series of methods which formulate convex optimization prob- 
lems to give sparse hierarchical interaction models. Yuan et al. (2009) modify the nonnegative garrote 



(Breiman 1995) by adding linear inequality constraints to enforce hierarchy. In this sense, our method can 



be seen as the adaption of their approach to the Lasso. 

Finally, as discussed in Sectio n |1.3[ another set of convex methods make use of the Group Lasso penalty 
(Yuan & Lin |2006 ). Zhao et al. (2009) (and, relatedly, Jenatton et al.|[20lT ) describe Composite Absolute 
Penalties (CAP), a very broad class of penalties that can achieve group and hierarchical sparsity. To 
achieve "hierarchical selection," they put forward the principle that a penalty of the form ||(</>i, </>2)|l7 + |<^i|, 
with 7 > 1, induces <f>2 to be zero only when (j>\ is zero as well. For hierarchical interaction models, they 
suggest a penalty of the form A X^ </c l"l®jfcl + II (§21 P k , Qjfc) II -n-.fcl ■ This fram ework has been developed in the 
structured sparsity literature (e.g., |Bach et al.|2012 ). jRadchenko fc James| ( |2010| ) introduce VANISH, which 



uses this nested-group principle to achieve hierarchical sparsity in the context of non-linear interactions. 
Their penalty in the setting of ([I]) is [Ai||(/3?, © j ) 1 1 2 + A2 1 1 Hi]- As noted in Remark 3, our proposal is 
closer to CAP and VANISH than it may first appear. Our problem can be rewritten to have a penalty of the 
form A [ max {l/3j'|j ll^jlli} + (1/2) || @j ||i] ■ In this sense, the penalty is in the spirit of CAP and related 
methods, although it does not quite fall into the class of CAP (since ours involves a sum of norms of norms). 
It is most similar to VANISH in that it combines all of Qj into the term involving j3j. 

5 Empirical Study. 



5.1 Simulations. 

Our main interest in this section is to study the advantages and disadvantages of restricting one's 
interaction models to those that honor hierarchy. Clearly, the effectiveness of such a strategy depends on 
the true model generating the data. We take n = 100 and p = 30 (435 two-way interactions) and consider 
four scenarios: 

(I) Truth is hierarchical: Q jk ^ =$> /3j ^ 0, j8 fc ^ 0. 

(II) Truth is anti-hierarchical: Qjk 7^ =>■ (3j = 0, /3k = 0. 

(III) Truth only has interactions: (3j — for all j. 

(IV) Truth only has main effects: Qjk = for all jk. 

In cases (I), (II), (IV), we set 10 elements of (3 to be nonzero (with random sign) and in cases (I), (II), (III), 
we set 20 elements of the submatrix of 6 = T to be nonzero. The signal-to-noise ratio (SNR) for the main 
effects part of the signal is about 1.5 whereas the SNR for the interactions part is about 1. 

We study the effectiveness of the hierarchy constraint in the context of both the Lasso and forward 
stepwise regression. Forward stepwise regression refers to a greedy strategy for generating a sequence of 
linear regression models in which we start with an intercept-only model and then at each step add the 
variable that leads to the greatest decrease in the residual sum of squares. We choose forward stepwise as 
a basis of comparison since it has a simple modification that we think may be the hierarchical interactions 
approach most commonly used by statisticians. The modification is to restrict the set of interactions that 
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could be added at a given step to only those between main effect variables currently in the model. A 
backward stepwise version of this approach is suggested in Peixoto (1987). 
We compare six methods, corresponding to each cell of the following table. 





Hierarchical All-Pairs Main Effects Only 


Lasso 
Fwd Stepwise 


HL (our method) APL MEL 
HF APF MEF 



Each method has a single tuning parameter: for the Lasso methods, the penalty parameter, A, and for the 
forward stepwise methods, the number of variables, k. We fit each method along a grid of tuning parameter 
values and select the model with the smallest mean square error, E\\y — fi\\ 2 . Note that such an operation 
is only possible in simulation since it requires knowing /i; however, doing so avoids the added variance of 
cross validation without being biased in favor of any particular method. The results presented are based on 
100 simulations from the underlying model. Figure [3] shows the expected prediction error, a 2 + E[(y — fj,) 2 ]. 
Panel (I) shows that when the truth is hierarchical, methods that assume hierarchy (HL, HF) do better than 
the rest. These methods have "concentrated" their power on the correct set of models and therefore receive 
the biggest payoff for being correct. APL does better than MEL and MEF since it succeeds in incorporating 
some of the correct interactions (recall that interactions make up one quarter of the signal). In Panel (II), we 
notice our first surprise — that HL predicts well relative to the others even when the truth is not hierarchical! 
We would have expected APL (or APF) to be the clear winner in this situation since surely the hierarchy 
assumption can only be detrimental in this "anti-hierarchical" scenario. The reason APL doesn't outperform 
HL in this scenario is because APL has trouble identifying the main effects (they get swamped by the 435 
interaction variables). In light of Section [3~T| this is where the hierarchy constraint helps — main effects are 
penalized by less and interactions by more. Even though APL is better able to find the correct interactions 
than HL, as seen in Panel (II) of Figure |4j APL does not predict as well as HL because it fails to find 
the main effects, which constitute three quarters of the signal. Relatedly, in a "hierarchical truth" scenario 
similar to (I) but with p > n (not presented here) we have in fact observed MEL doing better than APL 
(though not as well as HL) since APL is not able to detect interactions accurately enough to make up for 
its inferior ability to detect main effects. By contrast, HL does best in that scenario, aided by hierarchy to 
capture both the main effect and interaction components of the signal. 

In Panel (III), we see a situation where APL does dominate HL. Since there are no main effects in the 
signal, all that is relevant is a method's ability to find the interactions. HL identifies fewer correct interactions 
than APL since any main effect "information" that HL is using is spurious. Finally in Panel (IV), we see 
a situation where MEF, HF, MEL do better than the rest. Here again we find that the hierarchy methods 
beat the all-pairs methods since they favor main effects. 

It is particularly illuminating to note the difference in performance between HL and HF. HF in scenarios 
(II), (III), and (IV) performs very similarly to the main effect only models. In (II) and (III), HL does much 
better than HF both in terms of prediction error and in ability to correctly identify interactions. HL appears 
to be far less sensitive to violations of hierarchy than HF. This difference is attributable to the joint nature in 
which HL acts: the decision to include a main effect is made at the same time as decisions about interactions. 
This allows a strong interaction to "pull" itself into the model. By contrast, HF selects main effects with no 
regard to the information contained in the interactions. 
5.2 Data examples. 

Rhee et aT| ( |2006| ) study six nucleoside reverse transcriptase inhibitors (NRTIs) that are used to treat 
HIV-1. The target of these drugs can become resistant through mutation, and they compare a collection of 
models for predicting these drug's (log) susceptibility — a measure of drug resistance — based on the location 
of mutations. In the six cases, there are between p = 211 and p = 218 sites with mutations occurring in the 
n = 784 to n = 1073 samples. While they focus on main effect only models, we consider here the all-pairs 
lasso (APL) and weak hierarchical lasso (HL) in addition to the standard main effects lasso (MEL). We train 
on half of the samples and test on the remaining samples. To reduce the dependence of the results on the 
particular random training-test split, we repeat this process twenty times and average the results. Figure [5] 
shows the average test RMSE versus the average practical sparsity for each of the six drugs. In all cases but 
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(I) Hierarchical truth (II) Anti-hier. truth (III) Interactions only (IV) Main effects only 
(base=259.55) (base=247.51) (base=109.1) (base=201 .49) 




HL HF APL APF MEL MEF HL HF APL APF MEL MEF HL HF APL APF MEL MEF HL HF APL APF MEL MEF 



Figure 3: Prediction error: Dashed line shows Bayes error (i.e. a 2 ) and the base rate refers to the prediction 
error of Strain- Green, red, and blue colors indicate hierarchy, all-pairs, and main effect only, respectively; 
solid and striped indicate Lasso and forward stepwise, respectively. 



(I) Hierarchical truth 

Sensitivity Specificity 



(II) Anti-hier. truth 

Sensitivity Specificity 
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HL HF APL APF HL HF APL APF 



(III) Interactions only 

Sensitivity Specificity 
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HL HF APL APF 





Figure 4: Plots show the ability of various methods to correctly recover the nonzero interactions. This is 
the sensitivity (i.e., proportion of Ojfc for which Qjk ^ 0) and specificity (i.e., proportion of Qjk = for 
which Qjk = 0) corresponding to the lowest prediction error model of each method. 
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Figure 5: HIV drug data: Test-set RMSE versus practical sparsity (i.e., number of measured variables 
required for prediction) for six different drugs. For each method, the data from all 20 runs are displayed in 
faint colors; the thick lines are averages over these runs. 



ABC, we find that HL achieves a better test error at most levels of practical sparsity than APL. That said, 
if the number of mutations one has to measure is not of concern (so that we can choose for each method 
the minimum RMSE model), then no method dominates in all the situations. It is worth conceding — since 
this is a paper on interactions — that in several of the cases a pure main effects model appears to be the best 
option. 

6 Algorithmics. 

Some of the fastest Lasso solvers rely on coordinate descent, which amounts to iteratively applying ^ 
until convergence ( |Friedman efaI][20To| . |Tseng| ( |200T| proves that blockwise coordinate descent converges 



to the global minimum for a convex problem specifically when the nondifferentiable part of the problem is 
blockwise separable. In the case of the Strong Hierarchical Lasso, the hierarchy constraints combined with the 
symmetry constraint couples all the parameters together, meaning that coordinate descent is prone to getting 
stuck at suboptimal points. To see this, note that Qjk = Qkj appears in two constraints ||Oj||i < fit + p7 
and ||Ofc||i < /3t + /3jT. By contrast, the constraints in the Weak Hierarchical Lasso problem are blockwise 
separable so that blockwise coordinate descent on blocks of the form (Qj, f$J , /3~) for j = 1, . . . ,p does work. 
We begin by discussing our approach to solving the Weak Hierarchical Lasso problem. In Section |6.2| we 
discuss how we can solve a sequence of Weak Hierarchical Lasso problems that converges to a solution of the 
Strong Hierarchical Lasso. 

6.1 Solving the Weak Hierarchical Lasso. 

While blockwise coordinate descent would work for solving the Weak Hierarchical Lasso problem, we 
instead describe a generalized gradient descent approach. Given a problem of the form 

Minimize g(<j>) + h{<j>) , (10) 
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Algorithm 1 WEAK-HIERNET: Generalized gradient descent to solve Weak Hierarchical Lasso, ([T]), with elastic 
net penalty e. 

Inputs: X £ R nx P, Z G R"*^- 1 ), A > 0. Initialize (/3+(°),/H 0) , GK°)). 
For A; = 1, 2, . . . until convergence: 

Compute residual: fO" 15 «- y - - /H^" 1 )) - ZQ {k -^ /2. 

For j = 1, . . . ,p: 

«_ o NE ROW( ( 5/3+ (fc - 1) - tXjr^, 

where 0NER0W is given in Algorithm 3, 6 = 1 — te, and Zu .\ G ]R nx (P~ 1 ) denotes the columns of Z involving 
X r 



in which g is convex and diffcrcntiablc with a Lipschitz gradient and h is convex, generalized gradient descent 
works by solving a sequence of problems of the form 

4> k <- argmin^-ll^- [<j> k _ x - tVg(<j> k -i)]\\ 2 + h{4>), 
4> Zt 



where t is a suitably chosen step size (Beck & Teboulle 20091. These subproblems are easier to solve than 



( 10 1 since they replace g by a spherical quadratic. Under the previously stated conditions, generalized 
gradient descent is guaranteed to get within 0(l/k) of the optimal value after k steps; in fact, with a simple 
modification to the algorithm, this rate improves to 0(l/k 2 ) ( |Beck &: Teboulle| [2009 ) . Looking back at ([7|, 




we take g to be the differentiable part, q(j3 Q , (3 + — (3 , 6) + A1 J (/3 + + /3 ), and h to be the i\ penalty on 9 
and the set of constraints. The subproblem is of the form 

Minimize -||/3+ - + -||/T - /T l| 2 + ^||G- 6|| 2 F + ^||e]]i 
/3±eRp, OGR pxp it It It I 

for j = 1,... ,p, 

where (/3 + , /3~ , 0) depends on the previous iteration's solution and the data, X, Z, and y. The exact form 
of (/3 + ,(3~,Q) is given in Algorithm 1 for solving Q, where q((3o,(3,0) includes an elastic net penalty 
as described in Remark 2 of Section [2j The above problem decouples into p separate pieces involving 
(6j ,ly,/3j) that could be solved in parallel: 

Minimize i(# - £t) 3 + ~ {fij - $jf + i||0 i - ©J 2 + ^]|e^|| a (U) 
pfesL, e^GRf- 1 2t it it 2 

s-t. ||e j -||i</3 J + + /3 J -, /3+>0, pj >0. 



In Appendix |E[ we derive an algorithm, 0NER0W, that solves (11) based on the observation that in terms of 
an optimal dual variable, a, a solution is simply 0j = S[<dj, t(X/2 + a)] and jSj — \fij + ta] + . 

We solve ^ along a sequence of A values, from large to small, using the solution from the previous A 
as a warm start for the next. The WEAK-HIERNET algorithm gets within e of the optimal value of ^ in 
0(p 2 max{n,p}/e) time. 

6.2 Solving the Strong Hierarchical Lasso. 

In Section |6.1| we noted that each step of generalized gradient descent conveniently decouples into 
p single- variable optimization problems. However, for the Strong Hierarchical Lasso, the symmetry 
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Algorithm 2 STRONG-HIERNET: Solve ^ via ADMM. 
Inputs: X G R nxp , Z G M™*?^- 1 ), A > 0, p > 0. 

Initialize (/?+, /3",e), O, U. 

Repeat until convergence: 

1. WEAK-HIERNET(A, Z, A), but in the call to DNEROW replace the argument 58^* _1) - tZT ,f( fc_1 ) with 
dGf'V - tZTj .r^-V + p[0f~ 1] ~ Qj] + Uj. Also, initialize with 0+J~,Q). 

2. 6<- i(e + e T ) + ±{u + u T ). 

3. f7<- U + p(Q- h). 



constraint ties all variables together. We therefore make use of Alternating Direction Method of Multipliers 
(ADMM), which is a very widely applicable framework that allows convex problems to be split apart into 
separate easier subproblems ( Boyd et al.||2011 ). 



Given a convex problem of the form Minimize^ /(</>)+g(</>), we rewrite it equivalently as Minimize^ /(</>)- 
g((f) s.t. <j> = ip, and then the ADMM algorithm repeats the following three steps until convergence: 

1. = argmin [/(<£) + (p/2)\\cf> - £ + u/p\\ 2 ] . 



if = argmm 



ff(v) + (p/2)||v-0-«/pll 



3. u u + p{4> — <p)- 

Thus, the ADMM algorithm separates the two difficult parts of the problem, / and g, into separate opti- 
mization problems. The dual variable u serves to pull these two problems together, resulting in an algorithm 
that is guaranteed to converge to a solution as long as p > 0. In practice, the value of p affects the speed of 
convergence. 

In our case, we use ADMM to separate the hierarchy constraints, involving (/3 + , f3~ , 0) from the symmetry 
constraint, which will involve a symmetric version of O, which we call f2: 

Minimize q(p , f3+ - /T, 9) + \l T (f3+ +/T) + heh (12) 



2 



T ll%||i< ft + 07 1 

s.t. n = n T ,e = n ' 3 3 Ur? = i,...,p. 

The resulting ADMM algorithm is given in Algorithm 2, which is explained in greater detail in Appendix 
[G] Conceptually, the algorithm alternately updates two matrices, & and f2. Throughout the algorithm, 
we update O by solving a version of problem ([7|, and we update f2 by symmetrizing a version of 0. At 
convergence, = f2, and thus is both symmetric and satisfies the hierarchy constraints. 

7 Discussion. 

In this paper, we have proposed a modification to the Lasso for fitting strong and weak hierarchical 
interaction models. These two approaches are closely tied and our algorithms to solve the two exploit their 
similar structure. A key advantage of our framework is that it admits a simple characterization of the 
effect of imposing hierarchy. We compare our hierarchical methods to the Lasso and stepwise procedures to 
understand the implications of demanding hierachy. We introduce a distinction between models that have 
a small number of parameters and those that require measuring only a small number of variables. The 
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hierarchical interaction requirement favors models with the latter type of sparsity, a feature that is desirable 
when performing measurements is costly, time consuming, or otherwise inconvenient. The R package hierNet 
provides implementations of our strong and weak methods, both for Gaussian and logistic losses. This work 
has potential applications to genomewide association studies. In future work, we intend to extend this 
framework to contexts in which only certain interactions should be considered such as in gene-environment 
interaction models. 
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Appendix 

A Proof that pi) and ^ are equivalent. 

We rewrite ^ in terms of /3 = /3 + — f3~ rather than f3~: 

Minimize q(/3 ,P, 8) + Al T (2/?+ - /3) + ~||9|| a 

/3 eR, /3,/3+eR p , eeiRpxp 2 

s.t. 8 = 8 T , /3+ > 0, 0+ > ft 118,11! < 2ft+ - ft. 

or 

Minimize q((3 ,(3, 8) + Al T (2/3+ - 0) + £||0||i 

s.t. e = e T , max{[/y + , (||e j ||i + ft)/2}<ft+ 

where [ft] + = max{ft, 0} is the positive part of ft. This problem is the epigraph form of 

p \ 
Minimize q(p , (3, 8) + A V(max{2[ft] + , ||8, || 1 + ft } - ft) + - ||e||i 

3=1 

s.t. e = e T 

which reduces to (|6| since 2[ft] + — ft = |ft|. 

B Effect of constraint. 



For notational simplicity, we write r(f3 + ,f3 ,8) G R ra to denote the residuals, y — y((3 + ,j3 ,8), as a 
function of the parameters. The strong Hierarchical Lasso problem is the following: 

minimize^- , e , 0~ , 8)|| 2 + Ail T 03+ + /T) + A 2 ^W^j ||i 

j 

s.t. ||8j||i < ft+ + ft" and ft + > 0, ft" > for each j, 8 = 8 T 

The Lagrangian is 

L{fr a, S, 7 ± , CO = ^||K/3 + ,r , e)|| 2 + A!l T (/3+ + /T) + A 2 (E7, 8) • »^ r / B ' " ^ + " ^ 

i 

-7M t -% 7 ^ 7 + < 5 ' - 0T ) 

= i||r(/3+,/?-, e)|| 2 + (Ail - a - 7 + ) T /? + + (Ail - a- ^ P~ 
+ (S-S T + diag(A 2 l + a)!/, 8). 
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where a, 7 ± , S, U are dual variables. According to the KKT conditions, (</>; d, S, ^ , U) is an optimal primal- 
dual pair if and only if 

±xjr0+J-,Q) = \ 1 -a j -jf 
( Xj * x fe ) T r(/3+, /T , 0)/2 = (A 2 + &j)U jk + S jk - S kj 
= PfT? O = & j (\\Q j \\i-0+-0j) 

e = e T , $ ± >o, \\Qj\u <ft + pj 0,7* >o 

fj = |sign(§j fc ) Q jk ^ 

jk \e[-i,i] e jfe = o. 

Now, letting r( _J ) = r($ + ,$~ , 9) + ($j~ — $J)xj and recalling that ||a;j|| 2 = 1, there are three cases to 
consider: 

1. > 0, = 0: 

xj(r^ - 0+ Xj ) = Ai — a,- - 7+ => /3+ = [xjr^ - (X 1 - &,•)]+ 
Note that this case applies when xjr^~^ > Ai — dj. Thus, in this case — = S(xjr^~ : '\ Ai — dj). 

2. /3+ = 0, /3- > 0: 

-a;J(r<-'> + Z^) = Ai - - 77 /?7 = [-zJr<-'> - (Ai - &,-)]+ 

Note that this case applies when xjr^~^ < — (Ai — dj). Thus, once again $j~ — = S(xjr^~ : >\ \± — 
otj). 

3. $+ > 0, 07 > ( => 7+ = 0, 77 = 0) 

±a;J(r<-'> - 0+ - $r) Xj ) = Ai - d,- => $+ - $7 = xjr^ 
Note that this case applies when otj = Ai, so trivially ftf — $7 = S(xJr^-~ : '\ Ai — dj). 

Thus, we have shown that fif — $7 = S(xjr^~ :i \ Ai — otj). 

We can get rid of S by rewriting the subgradient equation involving it as 

(xj * x k ) T r0+J-,e) = (2A 2 + &j + a k )U 3k 

(note that symmetry in implies that there exists a symmetric U). 
Now, letting r^V = r(/3+, 6) + { Xj * x k ){Q jk + Q kj )/2, we get 

QjkWxj * x k \\ 2 = ( Xj * x k ) T r(- jk) - (2A 2 + otj + a k )U jk = S((xj * x fe ) T r ( ^ fc) , 2A 2 + a,- + a k ). 

This completes the proof for the Strong Hierarchical Lasso. Note that in the Weak Hierarchical Lasso case, 
the KKT conditions arc identical except we do not have the constraint O = T and we take 5 = 0. Thus, 
the relevant condition is simply 

{Xj * x k ) T r0+J-,Q) = 2(A 2 + 6tj)U jk = 2(A 2 + a k )U kj . 

Note that the second equality implies that Uj k U k j > (since a > 0) and that if \Uj k \ = 1, then otj < a k and 
vice versa. Rearranging terms, we have 

(Qjk + @kj)\\xj * x k \\ 2 /2 = { Xj * x k fr { -' jk ^ - 2(A 2 + aj)U jk 

= (Xj * x k ) T r(- jk) - 2(A 2 + a k )U kj 
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Now, UjkUkj > implies QjkQkj > which implies that (Qjk + Qkj)/2, if nonzero, has the same sign as 
whichever of Qjk or Qkj (or both) is nonzero. 
There are four cases: 

1. Q jk + 0, Q kj = 0: 

(Qjk + e kj )\\xj * x k \\ 2 /2 = ( Xj * x k fr^ - 2(A 2 + &j) ■ sign(9 ifc ) 

= (xj * x k ) T r^ jk) - 2(A 2 + a,-) • sign(9 3fc + Q kj ) 

and < a k since |fj/s| = 1. 

2. 6 ife = 0, Q kj + 0: 

(e 3 fe + Qkj)\\xj * x k \\ 2 /2 = (xj * x k ) T r(- j V - 2(A 2 + a k ) ■ siga(Q kj ) 

= (xj * x k ) T r(- j V - 2(A 2 + a k ) ■ sign(9 jfc + Q kj ) 

and a k < Aj since \Ukj\ = 1- 

3. e ife ^ o, e fcJ ^ o: 

(Qjk + Qkj)\\xj * x k \\ 2 /2 = ( Xj * x k fr { -^ - 2(A 2 + &j) ■ sign(9 ife ) 

= (xj * x k ) T r { -^ - 2(A 2 + aj) • sign(9 jfc + Q kj ) 

and 6ij — dfc since \Uj k \ — \U k j\ = 1. 

4. e ife - o, e fcJ = 0: 

(e ifc + e fcJ )||^- * x fc || 2 /2 = o 

= S((xj*x k ) T r^ k \ 2(X 2 + a j )) 
= S(( Xj *x k fr^ k \ 2(X 2 + a k )) 

where the latter two equalities follow since \(xj * Xk) T r(~i k } \ < 2(A 2 + otj) and \( X j * Xk ) T r^~^' \ < 
2(\ 2 + a k ) 

We can encapsulate all of this into a single, simple expression: 

(Qjk + Qkj)\\xj * x k \\ 2 /2 = S(( X j * x k ) T r { - 3k \ 2(A 2 + mm{a 3 ,a k })) 

C Proofs of strong and weak hierarchy. 

We begin by proving Lemma 1, which characterizes all solutions to j5|as a relatively simple function of 
y. The structure of our proof is based on (Ryan) Tibshirani & Taylor (2011 2012). 

C.l Characterizing the solution. 

For ease of analysis, we write Q equivalently in terms of G + and O - . Also, for notational simplicity, we 
write <fi — (f3 + , j3~ , 9 + , _ ) and X = (X; — X; Z/2; — Z/2). The strong Hierarchical Lasso problem is the 
following: 

minimize^ ,(,-,&+, e- \h ~ X<t>\\ 2 + Xil T (P + + P~) + A 2 (11 T , 0+ + 9") 

s.t. 1 T (9+ + 97) < pj + pj and pf > 0, pj > for each j, 
Q+-Q- = Q +T - Q- T , Qf k > 0, Qfj = 
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In introducing @ ± , we are not in fact changing the problem since at a solution ± = max{±9,0} (for 
A2 > 0). To see this, note that given any feasible point with 0+ fe > and 0~ fc > 0, we can produce a feasible 
point with strictly lower objective by reducing Q^ k , ©~ fc , ©t-, ©^ all by equal amounts. 

We will try to make this as close in form and notation to (Ryan) Tibshirani & Taylor (2012 1 's treatment 
of the generalized lasso problem as possible. Our optimization problem is of the form 



minimize^ - \\y — X<fi\\ 2 + w T <f> s.t. D(f>>0, L<f> = 0. 



(13) 



The Lagrangian of this problem is 



L(<fr,n,u) = - Wy-Xtf+w't-fi 1 D<j) + V T L(t>, 

where ji > and v are dual variables. The KKT conditions for (4>(y), (fj-(y), v(y))) to be an optimal primal- 
dual pair are the following: 

X T (y - X(j>) = w- D T fi + L T v 

A > 

D<j> > O,L0 = 0. 

Now, define the "boundary" and "active" sets as 

B(fi) = {i : h = 0} 
A($) = {i : [D^ > 0}. 

These are not necessarily unique since ((f), (fi,v)) may not be unique. In terms of the active set A((f>), the 
KKT conditions become 



+ L T £> 



L$ = D 



= 0. 



Solving for <j), we get the following characterization of a Strong Hierarchical Lasso solution: 

Lemma 1. Suppose cf> is a solution to the Strong Hierarchical Lasso problem ^ (taking g(/3o,/3,©) to be 
the quadratic loss) with A((j)) = {i : [D<p]i > 0}. Then, <f> can be written in terms of A((f>) and y as 

9 = (X -Pnull(L)nnull(D _ M l ) )) + (V — (^null(L)nnull(D_ ^. 7, ) 

X I )+w) + b, 

where b G null(X) n null(i) (1 nvJ\(D_ A ,u) satisfies 

A[(^-Pnull(L)nnull(D__ 4( - 4) )) + (2/ - (-PnuU(i)nnuU(D _ A( ft)X T ) + w ) + b ] > 

for all i e A(4>). 



Proof. Defining D 



D 



i 4 ^) ) and P = P 



null(D) 



P 



null(L)nnull(D_^ (i j ) ) 



we solve for d> in the same manner 



as is done in (Ryan) Tibshirani & Taylor (2012). Since D<f> = is equivalent to P<f> = <f>, we have PX T (y 
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XP4>) = Pw^ We see that Pw £ co\(PX T ) and thus Pw = {PX T )(PX T )+ Pw. Thus, PX T XP$ = 
PX T (y — (PX T ) + Pw) from which we get 

$ = (XP) + (y - {PX T )+Pw) + b, 

for b £ null(XP) and such that Db = and Di<j> > for i £ A((f>). To complete the result, we observe that 
the first two conditions reduce to b £ null(X) n null(-L) n mi\\(D_ A ^). □ 

We will use this characterization of a solution both to prove that the hierarchy property holds with 
probability one under weak assumptions and to derive an unbiased estimate of the degrees of freedom. 

~L^' S ) morc cx Pli c itly an( i introduce a little notation that will be 

useful later. Every row of D corresponds to an inequality constraint, and we can partition these rows into 
ten subsets: 



Before we do so, we write out D = 



C = {j:l T (Q+ + ej)<$++$7} 

V{fi±) = {j : pf > 0} 

± 
3k 



p(e±) = {j * k ■. e± > o} 



T = C C 
Z0 ± )=V0 ± ) c 
Z(& k )=V(& k ) c 



(14) 



The set A{&) c is made up of T, Z0+), Z0~), Z(Q+), and Z(Q~). The matrix D has 2p + 2p 2 columns 
that can be partitioned as (D@ : D@ : D B : D e ) and a row for every constraint. The rows of this matrix 
are the following (where ej and l p are row vectors): 



Rl. 


For each j £ T, 


[ e i 


e i 


—ej <g> l p 


-ej ® l p 


R2. 


For each j £ Z(J3+), 


{ e 3 











R3. 


For each j £ Z(j3~), 


' o 


e i 








R4. 


For each jk £ 2(6+), 


; o 





ej <g) e& 





R5. 


For each jk £ Z(Q~), 


; o 








ej ® e k 


R6. 


For each j, 


; o 





e j ® e j 





R7. 


For each j, 


; o 








e 3 ® e 3 


R8. 


For each j < k, 


; o 





ej ®eu + ek® e 3 


—ej ® efc — efc ® 



We will refer to this in the proofs that follow. 
C.2 Proof of strong hierarchy (Property [2). 

Including the elastic net penalty, (e/2)||/3+|| 2 + (e/2) |/3* 
and y in ( 13 1 by 

/ X -X Z/2 -Z/2 \ 
y/el p 

y/elp 

\ o o v~ eIp2 -^i p2 J 



(e/2)||©|| p, is equivalent to replacing X 



X f = 



and y e 



y 

(2p+p 2 )xl 



Suppose we solve (13 1 with the above design matrix. By Lemma [T] 

4> = {X e Pnul\{L)nnu\\(D _ A( ^)) + {V ~ (-Pnull(L)nnull(D _ A{i) )^T) + w ) + & i 

for some b £ null(X e ) H null(L) n null(D_^/^) satisfying 

A[(^null(L)ni 1 ull(I5__ 4( ^ ) )) + (y - {Pnu\\{L)nnn\\(D_ Mi) )X T ) + w)+b] > 

for all i £ A((f>). Let S v : R 2 p +2 p —> RM be the linear operator that selects the part of a vector corresponding 
to the variable v. Now, b £ null(X e ) implies that Sp+(b) — Sp-(b) = and SQ+(b) = —S®-(b). We showed 
earlier that we cannot have <d^ k > and Qj k > 0. This means that for any jk, there must be an i £ A(<f>) 
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for which Di<f> = corresponds to S B + ((f)) = or S Q - (</>) = 0. Thus, D_. ( i^b = means that 5' e + (6) = 

jk jk •^■\ ( P) jk 

or S B - (b) — for each jk. This implies that null(X e ) n null(D_^^^) = {0} and thus b = 0. 
We show now that P (\Jj{fif — fij > 0}^ = 0. In terms of our above notation, this is 

P |lJ{^ T (4 - e J) = 0. t T 4 > 0, $ T ej > 0}j = 0, (15) 
where G IR 2 p+ 2 p is the vector with all zeros except for Sp±(e^) = 1. Consider the set 



Af = LU U? =1 {z :[(X e P null ( i)nnuU ( D _ A )) + (z - (P n nlKL)nnn\l{D_ A )Xj) + w)] T (ej - ej) = 

[(^e-Pnull(L)nnull(Ll_ A )) + ( 2: ~ (Pmn(L)nnull(r>_.4)^J) + u ')] Te j h > 
[(X e P nu \ l ( L - )nnun ( D _ A - ) ) + (z - (P lm \i(L)nmii\(D- A )X'[) + w)] T eJ > 0}, 



In light of (14 1, fixing A automatically specifies T, V(j3 ),"P(0 ). The outer union is restricted to those 



subsets A of 2p + 2p 2 elements that would have P(9+) H 7>(9~) = and V((3 + ) H P(/3~) C T. The event in 



(151 is contained in {y e € Af} since it corresponds to the case in which A is A(<f>). We begin by showing that 
A(4>) is in this restricted union with probability one. We have already argued that at a solution we must 
have (-) .', (-) ,. = for all jk. Now, /3+ > and f}J > together imply that 1 T (6>+ + &j) = $f + /3j since 

otherwise we could lower the objective by reducing {5~j and (3~ without leaving the feasible set. Therefore, 
it would be sufficient to show that P(y e € Af) — 0. We do so by observing that {y e € Af} is a finite union of 
zero probability sets. 

We begin by establishing that null(/i:„(P null(L)nnuU (D_ A )^ e T ) + ) = null(XP nu n (L)nnuU(D _ A) ). To do so, 
we write P — P n ull(.L)nnull(r>_, A ) as P = UU T for some U T U = I. Now, row(XP) C row(P) = co\(U), 
so we can write U — (U\ : U2), where row(XP) = col(fJi) and thus XPU2 = 0. Since PU% = U2, it 
follows that XU 2 = 0. Write Uf = (C/f +T : Uf T : C/f +T : U?~ T ) for i = 1,2, and observe that 
UTXJ = [UfXT : ^uf T : ,/El/f " T : ,/i(E/f +T - V?"*)\ so that 

uTxZxm = + e[ut T uf + uf T uf + (C/f + - E7? _ )(l7 2 e+ - Ui')} 
= e{U?U 2 - Uf +T Uf - Uf~ T Uf + ] 

= - e [o + Y^ u i + u u 2~hk + [urwutu- 

jk 

Now, for each jk we must have jk £ Z(Q + ) U Z(Q~) since P(8 + ) n P(8~) = 0. Thus, for each jk, there 
is a R4 or R5 row in D-a and thus D-a(Ui : U 2 ) — implies that [[/f + ]jk — or [[7® ]jk = for each jk 
and likewise [f/ 2 e+ ] jfc = or [Uf~] jk = 0. Therefore, UlX^X e U 2 = 0. Now, 

" rT \+ — Y T>( PY T Y P^+ — YTTTT T ITTTT t Y t y TTTT t \ 



h:n{PX; ) = XP(PX^ X e P) + = XUU 1 (UU 1 X\ X.UU 1 ) 

rT 

u 



XUU T U{U T XjX,U) + U T = XU(U T X^X e U) + U T 



Yfjj m ri ^ X € Ui U 1 X t XJJ2 

x(u ' lh> [unU.Ut v?xrx,u. 



= xu 1 {ufxjx t u 1 ) + u^. 
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Now, U?X?X £ Ui >- since U?XfX e Ui = U^[X T XU 1 + J and J b and XU X has full column rank. Thus, 
null(Ji :n (PX e T )+) = n\i\\{Uj) = null(XP). This completes the first part of the proof. 

Next, we show that XP(ej - e 3 — ) ^ as long as (X e P) + (y e - (PX^) + w)] T ef > 0. Now, since 
j € n7 5 (/3+) n V{(3~) C T and so the only row of D that has A(e^) 7^ is the Rl row; but clearly 
A(et - ej) = 1 - 1 = 0. Thus, e+ - e null(5) and P(e+ - ej) = ef - ej. It follows that 



XP(ef-ej) = 



( 2^ \ 

V / 



7^ assuming e > 0. 



Putting these two parts of the proof together establishes that Ii- n (PXj) + (e^ — ej) 7^ 0. Thus, {y e € A/"} 
is a finite union of Lebesgue measure sets. This shows that P(y e G A/") = as long as y is absolutely 
continuous with respect to the Lebesgue measure on K". 
C.3 Proof of weak hierarchy (Property pi). 



An argument nearly identical to that of the previous section establishes that 6^, 7^ ==>■ f3j~ — jij 7^ 
with probability one. Thus, if f$J — $j = and p£ — f3j = 0, then both 8^ = and ®kj — 0. It follows 
then that (Qjk + ©fcj)/2 = 0. This establishes weak hierarchy. 

D Degrees of freedom. 

D.l Proof of unbiased estimate (Property H). 

The fit in terms of the active set is given by 

X(f> = (^Pnull(L)nn U ll(D_ A(if) ))(^-Pnun(L)nnull(Z5_^ (i)) )) + (2/ ~ (Pnutt(L)nnull(D _ A(}) ) X T ) + w) 

Of course, A^(^) — an d we can solve the KKT conditions to get the rest of the optimal dual variables 
in terms of the active set: 

A-^A = pT+ [w _ ^T (y _ ^)] + c 



where c £ null(£>) satisfies D T+ [w ~ X T (y~ X$)] + c > 0. 

Note that (j> — 4>{v) an( i thus A{4>) and b depend on y even though we don't write this explicitly. We will 
continue writing cj> to mean specifically 4>{y). For y' in a neighborhood of y, we might guess that <j){y') = f(y') 
and (fi(y , ),u(y')) = (g(y'),h(y')), where 

f(y') = (XPnun(L)nnull(D_ A ^)) + (y' - (Pnnll(L)nnull(D _ A{i) )X T ) + w) + b 
T-\- 

To verify this guess, we need to check that the pair (f(y'),{g(y'),h(y'))) satisfies the optimality conditions 
at y'\ 

X T {y> - Xf(y')) =w- D T g{y') + L T h(y') 
9i(y')(Df(y')) i = Q 

g{y') > 0, Df(y') > 0, Lf(y') = 0. 
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First of all, Lf(y') = holds since L(XP nnll{L)nnull{D _ A(J>)} )+ = and Lb = 0. Likewise, D_ A{ ^f(y') = 0. 
Now, D A ^f(y) > so by continuity of /, we have D A ^f(y') > for all y' in a small enough neighborhood, 

Ui, of y. This establishes that A(f(y'j) = A{4>). Now, g(y') A ^ = 0, so complementary slackness holds. To 
see that the first optimality condition holds, we can simply plug (g(y'), h(y')) into the left hand side. All 
that remains is to show that g(y')_ A ^ > 0. If we knew that A_^i(^) > 0j then by continuity of <? we could 
argue that over a small enough neighborhood, U2, 9{y')-M^\ > 0- However, it could be the case that fi. L = 
for some i ^ A((j>), i.e. i G B{(f>) \ A(4>). Nonetheless, one can show that there is a set N of measure for 
which y £ N implies that A((j>(y)) = A{<j){y')) and B{(p{y)) — B((j)(y')) for all y' in a neighborhood of y. 



Lemma 9 of (Ryan) Tibshirani & Taylor (2012) proves this result for a nearly identical situation. 

The fit X(j>(y) is a piecewise affine function of y. Using Stein's formula for the degrees of freedom (as 
described in Ryan Tibshirani & Taylor 2012), we get that 



df{Xcj>) = E[V ■ Xcj>(y)] = E[tr{(XP)(XP) + }} = E[mnk(XP)}, 
where P = P nu ii( L )nnu\i(D_ Mi) )- 

D.2 Proof of bound on estimate (Property [5]). 

We bound this by an estimate that is more interpretable: rank(XP) < rank(P) = nullity [ n 

Clearly, R2-R7 are linearly independent rows. Thus, the rank of D is at least |Z(/3 + )| + |Z(/3~)| + 
|Z(0 + )| + |i?(0 - )| + 2p. Now, an Rl row is linearly independent of R2-R8 precisely when j e T has 
j E Z((3 + )AZ(j3~). To see this, note that if j € Z(f3 + ) \ Z((3~), then Rl is certainly linearly independent 
of R2-R8 and likewise for j e Z(P~ ) \ Z(/3+); however if j € Z{(i+) n Z(/3~) then jk G Z(6+) n Z{®~) 
for all k G {1, . . . ,p} \ {j} and therefore this row of Rl lies in the span of R3-R7. Thus, this means there 
are |T\ (Z(f3 + )AZ(f3~))\ additional linearly independent rows. Finally, we consider R8. Clearly, R8 lies 
in the span of R4-R5 for jk G Z(Q+) n Z{Q~) since jk £ Z(G + ) kj G Z(Q+) at a solution. But if 

jk G V(0 + )UV(0-), then it is linearly independent of R1-R8. Therefore, R8 adds \V(Q + )\/2 + \V(Q-)\/2 
to the rank where we have used that V(Q + ) fl ^(B - ) = at a solution (since A2 > 0) and recalling that 
j < k for the rows of R8. In summary, we have shown that the row-rank is 

\Z(0+)\ + \Z(p-)\ + \Z(Q + )\ + \Z(Q~)\ +2p+\T\ (Z{p+)AZ(/3-))\ + |7>(8+)|/2 + |P(G")|/2. 

Recalling that there are 2p + 2p 2 columns, we get that 

nullity [ n L ) = (2p + 2p 2 ) - rank [ n 

= (p - \Z((3+)\) + (p- \Z(p-)\) + ( P (p - 1) - \Z(Q+)\) + ( P (p 1) - l-H(G-)l) 

- \T\ (Z(f3+)AZ(f3-))\ - W(Q + )\/2 - \T(Q-)\/2 
= + \V(P~)\ + \V{& + )\/2 + \V{®-)\/2 - \T\ (V(p+)AV(/3'))\ 

E Solving the prox function. 



The Lagrangian of (111 is given by 



- ( 7 + + a)f3+ - (7- + a)0- 

where a is the dual variable corresponding to the hierarchy constraints and 7 ± are the dual variables 
corresponding to the non-negativity constraints. For notational convenience, we have written 8 for Qj and 



24 



we have dropped the subscripts on /3 ± . The KKT conditions are 

0* - ( S ± )/t-7 ± - a = (0-0)/*+ (A/2 + a)u = 
£±^ = a(\\6\\ 1 -p+-p-) = 
^>0, 11% </3+ + /T a>0, 7±>0, 

where itj is a subgradient of the absolute value function evaluated at 0j . The three conditions involving 7 
implies that /3 = [(3+- + ta] + . The stationarity condition involving 6 implies that 9 = S(6, t(X/2 + a)). 
Now, define /(a) = \\S(6, t(X/2 + a))||i - [j3+ + ta} + - + ta]+. The remaining KKT conditions involve 
a alone: af(a) = 0, f(a) < 0, a > 0. Observing that / is non-increasing in a and piecewise linear suggests 
finding a as done in Algorithm 3. 



Algorithm 3 DNER0W: Solve (11) via dual 



Inputs: 13+ e R, <d e E p_1 , A > 0. 

1. Find a. Define /(a) = ||5(9j, *(A/2 + a))||x - 0+ +ta] + - \fij +ta]+. 

(a) If /(0) < 0, take a = and go to step 2. 

(b) Form knot the set V = {\@jk\/t - V 2 }*=1 u {-/^V*}, and let P+=Pn [0,oo) 

(c) Evaluate /(p) for p G V + . 

(d) If /(p) = for some p E V + , take a — p and go to step 2. 

(e) Find adjacent knots, pi,P2 € such that /(pi) > > f{p2)- Take 

a = -f(Pi)[f(P2) - f(pi)]/(P2 - Pi)- 

2. Return Qj = S[Qj, t(X/2 + a)] and $f = 0f + ta]+. 



F Solving the logistic regression problem. 

For notational simplicity, in this section we use X and 4> to denote the full data matrix and parameter 
combining both main effects and interactions. The binomial negative log-likelihood is 

n 

l{p , (/>) = - J2bji logPi + (1 - Vi) log(l - Pi)] 



where p l = p l ((3 , 0) = 1/(1 + e^ "** ^). Now, 



d(3 



-i'(y-p) 



V^{^4>)^-X T {y-p). 



Thus, to solve min^ Oj £(f3o, 4>) + h(c/)), we can use generalized gradient descent, which iteratively solves 



(/f \<^)) = argmin- 
Po,4> At 



A) 



(fc-i) 



(fc-i) 



+ * 



X T \y-P0t 1 \4> (k - 1) )] 



■h{cj>) 



This separates into two parts: 

^* , =^*- 1) +tl T |»-p(^ fc - 1) ,^- 1) )] 

0« = Prox 2 ,, + tX T [y - p$*~ r \ ^ (fc " 1) )l) 
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where Pr0x2t.fr refers to the minimizer of (111. Looking at Algorithm 1, we see that this algorithm is 
identical except that for each k we update the estimate of the intercept and that we compute the residual 
as y —p((3o,(j)). The "difficult" part of the computation is identical! 

G ADMM for Strong Hierarchical Lasso. 

The ADMM algorithm has three parts: 
1. Update (/3o, P ± , O) by solving 

Minimize q((3 ,P + - /T,e) + Al T (/3+ + /T) + £||0||i 



tr[tf(e-n)] + (p/2)||e-n| 



s.t. $f > 0, pj > for j = 1, . . . ,p. 

As with Algorithm 1, we may apply generalized gradient descent and DNEROW to solve this, but replacing 
the argument of ONERDW with 5ty k ~ 13 - tZT ,f ( fe_1 ) + p(ef~ 1} -Q) + U. 

2. Update fi by solving 

Minimize tr[f/(6 - fi)l + (p/2)||6 - s.t. fl = fl T . 

neR pxp 

This has the analytic solution ^- |(G + 9 T ) + i(C7 + C/ T ). 

3. Update U ir- U + p(B - Q): 

Algorithm 2 in the paper gives the full algorithm. 
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